Connecting SOC with RL – Importance sampling

AI as a tool in Mathematics

Alonso Cisneros

Freie Universität Berlin

Stochastic OptimalControlUncertainty Quantification Reinforcement LearningImportanceSampling
Optimal Biasingthe small leapStochastic OptimalControlUncertainty Quantification Reinforcement LearningImportanceSampling

Outline

  1. Crash course on RL

  2. What is importance sampling

  • The connection to optimization
  • Optimal biasing
  1. Optimal biasing as an RL problem

Crash course on Reinforcement Learning

A miniopoly board
  • The game has a state at turn \(t\) denoted \(s_t\)
  • At a turn \(t\) players roll the dice
  • The change in money after buying/paying rent/charging rent is recorded as a reward \(r_t\)

Important

We train our robot to maximize the rewards as it takes actions exploring the space of states

Can you describe the state space?

How does it look like?

It’s hard to describe the state space, but we can study the dynamics

Optimal Biasingthe small leapStochastic OptimalControlUncertainty Quantification Reinforcement LearningImportanceSampling

What if we don’t know how square transitions work?

  • We calculated transition probability with the knowledge of the dice

5 minutes to think

  • What is a reasonable way of guessing transition probabilities?
  • Can I be sure to observe even improbable but still possible states?

The right answers are:

  • By simulating the transitions and get empirical estimates
  • No

Markov Chain Monte Carlo

  • We let the robot roam around and buy squares as it pleases
    • For any square, it can either buy it or not
  • We register what it gained or lost by buying or not buying a square by the end of the game.

Optimal Biasingthe small leapStochastic OptimalControlUncertainty Quantification Reinforcement LearningImportanceSampling

What we’ve covered so far

Importance Sampling

  • We wanted to compute the expected reward of the robot after the entire game
  • Not every problem is this well behaved
  • This property is called metastability
  • Importance sampling aims to remedy this

Important

The general idea of importance sampling is to draw random variables from another probability measure and subsequently weight them back in order to still have an unbiased estimator of the desired quantity of interest

  • We want to explore space
    • We can’t make it to the other well
  • The jumps are exponentially less probable w.r.t the height of the barrier

More formally…

  • The metastable stochastic system we sample from follows Langevin dynamics \[\begin{equation} \mathrm{d}X_s = - \nabla V(X_s) \, \mathrm{d}s + \sigma(X_s) \, \mathrm{d}W_s \end{equation}\]
  • We want to hit a target set \(\mathcal{T}\). We define \[\begin{equation} \tau = \inf \{ s > 0 \mid X_s \in \mathcal{T} \} \end{equation}\]
  • We’re interested in computing \(I: C([0, \infty), \mathbb{R}^d) \to \mathbb{R}\) \[\begin{equation} I(X) \coloneqq \exp(- \mathcal{W}(X)) \end{equation}\]

Our main goal is to compute \[ \Psi(X) \coloneqq \mathbb{E}^x [I(X)] \coloneqq \mathbb{E}[I(X) \mid X_0 = x] \]

But…

  • MCMC has terrible properties because of metastability
  • No closed form exists

Tip

  • We can “push” the particle adding force, as long as we account for it and correct for that bias
  • That “push” is achieved by adding a control \(u\).

The new, controlled dynamics are now described as \[\begin{equation} \label{eq: controlled langevin sde} \mathrm dX_s^u = (-\nabla V(X_s^u) + \sigma(X_s^u) \, u(X_s^u))\mathrm ds + \sigma(X_s^u) \mathrm dW_s, \qquad X_0^u = x \end{equation}\]

Via Girsanov, we can relate our QoI to the original as such: \[\begin{equation} \label{eq: expectation IS} \mathbb{E}^x\left[I(X)\right] = \mathbb{E}^x\left[I(X^u) M^u\right], \end{equation}\]

where the exponential martingale \[\begin{equation} \label{eq: girsanov martingale} M^u \coloneqq \exp{\left(- \int_0^{\tau^u} u(X_s^u) \cdot \mathrm dW_s - \frac{1}{2} \int_0^{\tau^u} |u(X_s^u)|^2 \mathrm ds \right)} \end{equation}\] corrects for the bias the pushing introduces.

Important

The previous relationship always holds. But the variance of the estimator depends heavily on the choice of \(u\).

Clearly, we aim to achieve the smallest possible variance through on optimal control \(u^*\) \[\begin{equation} \label{eq: variance minimization} \operatorname{Var} \left( I(X^{u^*}) M^{u^*} \right) = \inf_{u \in \mathcal{U}} \left\{ \operatorname{Var} (I(X^u) M^u) \right\} \end{equation}\]

Connection to optimization

It turns out 1 that the problem of minimizing variance corresponds to a problem in optimal control

The cost functional \(J\) to find the variance minimizing control is \[\begin{equation} \label{eq: cost functional} J(u; x) \coloneqq \mathbb{E}^x\left[\mathcal{W}(X^u) + \frac{1}{2} \int_0^{\tau^u} |u(X_s^u)|^2 \mathrm ds \right], \end{equation}\]

With this formulation, \[\begin{equation} \Phi(x) = \inf_{u \in \mathcal{U}} J(u; x). \end{equation}\]

Important

The optimal bias achieves zero variance: \[\begin{equation} \operatorname{Var} \left( I(X^{u^*}) M^{u^*} \right) = 0. \end{equation}\]

Optimal biasing through RL

  • Let’s reconsider the SOC problem (excuse the change in notation)
  • We discretize the dynamics equation \[\begin{align*} s_{t+1} &= s_t + \left( -\nabla V(s_t) + \sigma u(s_t)\right) \Delta t + \sigma \, \sqrt{\Delta t} \, \eta_{t+1} \\ s_0 &= x \end{align*}\]

The time-discretized objective function is given by \[\begin{equation} \small J(u; x) \coloneqq \mathbb{E}^{x} \left[ g(s_{T_u}) + \sum_{t=0}^{T_{u-1}} f(s_t) \Delta t + \frac{1}{2} \sum_{t=0}^{T_{u-1}} |u(s_t)|^2 \Delta t \right] \end{equation}\]

  • Our stopping time \(\tau\) is now denoted \(T_u\)
  • The return we want to optimize depends on a rewards function \[ r_t = r(s_t, a_t) \coloneqq \begin{cases} - f(s_t) \Delta t - \frac{1}{2} |a_t|^2 \Delta t & \text{if} \; s_t \notin \mathcal{T} \\ -g(s_t) & \text{if} \quad s_t \in \mathcal{T}. \end{cases} \]
Optimal Biasingthe small leapStochastic OptimalControlUncertainty Quantification Reinforcement LearningImportanceSampling

References

Quer, J., and Enric Ribera Borrell. 2024. “Connecting Stochastic Optimal Control and Reinforcement Learning.” Journal of Mathematical Physics 65. https://doi.org/10.1063/5.0140665.
Ribera Borrell, Enric, Jannes Quer, Lorenz Richter, and Christof Schütte. 2024. “Improving Control Based Importance Sampling Strategies for Metastable Diffusions via Adapted Metadynamics.” SIAM Journal on Scientific Computing 46 (2): S298–323. https://doi.org/10.1137/22M1503464.